120 research outputs found

    Approximating Weighted Duo-Preservation in Comparative Genomics

    Full text link
    Motivated by comparative genomics, Chen et al. [9] introduced the Maximum Duo-preservation String Mapping (MDSM) problem in which we are given two strings s1s_1 and s2s_2 from the same alphabet and the goal is to find a mapping π\pi between them so as to maximize the number of duos preserved. A duo is any two consecutive characters in a string and it is preserved in the mapping if its two consecutive characters in s1s_1 are mapped to same two consecutive characters in s2s_2. The MDSM problem is known to be NP-hard and there are approximation algorithms for this problem [3, 5, 13], but all of them consider only the "unweighted" version of the problem in the sense that a duo from s1s_1 is preserved by mapping to any same duo in s2s_2 regardless of their positions in the respective strings. However, it is well-desired in comparative genomics to find mappings that consider preserving duos that are "closer" to each other under some distance measure [19]. In this paper, we introduce a generalized version of the problem, called the Maximum-Weight Duo-preservation String Mapping (MWDSM) problem that captures both duos-preservation and duos-distance measures in the sense that mapping a duo from s1s_1 to each preserved duo in s2s_2 has a weight, indicating the "closeness" of the two duos. The objective of the MWDSM problem is to find a mapping so as to maximize the total weight of preserved duos. In this paper, we give a polynomial-time 6-approximation algorithm for this problem.Comment: Appeared in proceedings of the 23rd International Computing and Combinatorics Conference (COCOON 2017

    Forty years of The Selfish Gene are not enough

    Get PDF

    Comparing biological networks via graph compression

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Comparison of various kinds of biological data is one of the main problems in bioinformatics and systems biology. Data compression methods have been applied to comparison of large sequence data and protein structure data. Since it is still difficult to compare global structures of large biological networks, it is reasonable to try to apply data compression methods to comparison of biological networks. In existing compression methods, the uniqueness of compression results is not guaranteed because there is some ambiguity in selection of overlapping edges.</p> <p>Results</p> <p>This paper proposes novel efficient methods, CompressEdge and CompressVertices, for comparing large biological networks. In the proposed methods, an original network structure is compressed by iteratively contracting identical edges and sets of connected edges. Then, the similarity of two networks is measured by a compression ratio of the concatenated networks. The proposed methods are applied to comparison of metabolic networks of several organisms, <it>H. sapiens, M. musculus, A. thaliana, D. melanogaster, C. elegans, E. coli, S. cerevisiae,</it> and <it>B. subtilis,</it> and are compared with an existing method. These results suggest that our methods can efficiently measure the similarities between metabolic networks.</p> <p>Conclusions</p> <p>Our proposed algorithms, which compress node-labeled networks, are useful for measuring the similarity of large biological networks.</p

    Identification and characterisation of tomato torrado virus, a new plant picorna-like virus from tomato

    Get PDF
    A new virus was isolated from tomato plants from the Murcia region in Spain which showed symptoms of ‘torrado disease’ very distinct necrotic, almost burn-like symptoms on leaves of infected plants. The virus particles are isometric with a diameter of approximately 28 nm. The viral genome consists of two (+)ssRNA molecules of 7793 (RNA1) and 5389 nts (RNA2). RNA1 contains one open reading frame (ORF) encoding a predicted polyprotein of 241 kDa that shows conserved regions with motifs typical for a protease-cofactor, a helicase, a protease and an RNA-dependent RNA polymerase. RNA2 contains two, partially overlapping ORFs potentially encoding proteins of 20 and 134 kDa. These viral RNAs are encapsidated by three proteins with estimated sizes of 35, 26 and 23 kDa. Direct protein sequencing mapped these coat proteins to ORF2 on RNA2. Phylogenetic analyses of nucleotide and derived amino acid sequences showed that the virus is related to but distinct from viruses belonging to the genera Sequivirus, Sadwavirus and Cheravirus. This new virus, for which the name tomato torrado virus is proposed, most likely represents a member of a new plant virus genus

    Differential metabolism of Mycoplasma species as revealed by their genomes

    Get PDF
    The annotation and comparative analyses of the genomes of Mycoplasma synoviae and Mycoplasma hyopneumonie, as well as of other Mollicutes (a group of bacteria devoid of a rigid cell wall), has set the grounds for a global understanding of their metabolism and infection mechanisms. According to the annotation data, M. synoviae and M. hyopneumoniae are able to perform glycolytic metabolism, but do not possess the enzymatic machinery for citrate and glyoxylate cycles, gluconeogenesis and the pentose phosphate pathway. Both can synthesize ATP by lactic fermentation, but only M. synoviae can convert acetaldehyde to acetate. Also, our genome analysis revealed that M. synoviae and M. hyopneumoniae are not expected to synthesize polysaccharides, but they can take up a variety of carbohydrates via the phosphoenolpyruvate-dependent phosphotransferase system (PEP-PTS). Our data showed that these two organisms are unable to synthesize purine and pyrimidine de novo, since they only possess the sequences which encode salvage pathway enzymes. Comparative analyses of M. synoviae and M. hyopneumoniae with other Mollicutes have revealed differential genes in the former two genomes coding for enzymes that participate in carbohydrate, amino acid and nucleotide metabolism and host-pathogen interaction. The identification of these metabolic pathways will provide a better understanding of the biology and pathogenicity of these organisms

    Genome Sizes and the Benford Distribution

    Get PDF
    BACKGROUND: Data on the number of Open Reading Frames (ORFs) coded by genomes from the 3 domains of Life show the presence of some notable general features. These include essential differences between the Prokaryotes and Eukaryotes, with the number of ORFs growing linearly with total genome size for the former, but only logarithmically for the latter. RESULTS: Simply by assuming that the (protein) coding and non-coding fractions of the genome must have different dynamics and that the non-coding fraction must be particularly versatile and therefore be controlled by a variety of (unspecified) probability distribution functions (pdf's), we are able to predict that the number of ORFs for Eukaryotes follows a Benford distribution and must therefore have a specific logarithmic form. Using the data for the 1000+ genomes available to us in early 2010, we find that the Benford distribution provides excellent fits to the data over several orders of magnitude. CONCLUSIONS: In its linear regime the Benford distribution produces excellent fits to the Prokaryote data, while the full non-linear form of the distribution similarly provides an excellent fit to the Eukaryote data. Furthermore, in their region of overlap the salient features are statistically congruent. This allows us to interpret the difference between Prokaryotes and Eukaryotes as the manifestation of the increased demand in the biological functions required for the larger Eukaryotes, to estimate some minimal genome sizes, and to predict a maximal Prokaryote genome size on the order of 8-12 megabasepairs. These results naturally allow a mathematical interpretation in terms of maximal entropy and, therefore, most efficient information transmission

    OrthoSelect: a protocol for selecting orthologous groups in phylogenomics

    Get PDF
    Background: Phylogenetic studies using expressed sequence tags (EST) are becoming a standard approach to answer evolutionary questions. Such studies are usually based on large sets of newly generated, unannotated, and error-prone EST sequences from different species. A first crucial step in EST-based phylogeny reconstruction is to identify groups of orthologous sequences. From these data sets, appropriate target genes are selected, and redundant sequences are eliminated to obtain suitable sequence sets as input data for tree-reconstruction software. Generating such data sets manually can be very time consuming. Thus, software tools are needed that carry out these steps automatically. Results: We developed a flexible and user-friendly software pipeline, running on desktop machines or computer clusters, that constructs data sets for phylogenomic analyses. It automatically searches assembled EST sequences against databases of orthologous groups (OG), assigns ESTs to these predefined OGs, translates the sequences into proteins, eliminates redundant sequences assigned to the same OG, creates multiple sequence alignments of identified orthologous sequences and offers the possibility to further process this alignment in a last step by excluding potentially homoplastic sites and selecting sufficiently conserved parts. Our software pipeline can be used as it is, but it can also be adapted by integrating additional external programs. This makes the pipeline useful for non-bioinformaticians as well as to bioinformatic experts. The software pipeline is especially designed for ESTs, but it can also handle protein sequences. Conclusion: OrthoSelect is a tool that produces orthologous gene alignments from assembled ESTs. Our tests show that OrthoSelect detects orthologs in EST libraries with high accuracy. In the absence of a gold standard for orthology prediction, we compared predictions by OrthoSelect to a manually created and published phylogenomic data set. Our tool was not only able to rebuild the data set with a specificity of 98%, but it detected four percent more orthologous sequences. Furthermore, the results OrthoSelect produces are in absolut agreement with the results of other programs, but our tool offers a significant speedup and additional functionality, e.g. handling of ESTs, computing sequence alignments, and refining them. To our knowledge, there is currently no fully automated and freely available tool for this purpose. Thus, OrthoSelect is a valuable tool for researchers in the field of phylogenomics who deal with large quantities of EST sequences. OrthoSelect is written in Perl and runs on Linux/Mac OS X

    Core Proteome of the Minimal Cell: Comparative Proteomics of Three Mollicute Species

    Get PDF
    Mollicutes (mycoplasmas) have been recognized as highly evolved prokaryotes with an extremely small genome size and very limited coding capacity. Thus, they may serve as a model of a ‘minimal cell’: a cell with the lowest possible number of genes yet capable of autonomous self-replication. We present the results of a comparative analysis of proteomes of three mycoplasma species: A. laidlawii, M. gallisepticum, and M. mobile. The core proteome components found in the three mycoplasma species are involved in fundamental cellular processes which are necessary for the free living of cells. They include replication, transcription, translation, and minimal metabolism. The members of the proteome core seem to be tightly interconnected with a number of interactions forming core interactome whether or not additional species-specific proteins are located on the periphery. We also obtained a genome core of the respective organisms and compared it with the proteome core. It was found that the genome core encodes 73 more proteins than the proteome core. Apart of proteins which may not be identified due to technical limitations, there are 24 proteins that seem to not be expressed under the optimal conditions

    Enzymes Are Enriched in Bacterial Essential Genes

    Get PDF
    Essential genes, those indispensable for the survival of an organism, play a key role in the emerging field, synthetic biology. Characterization of functions encoded by essential genes not only has important practical implications, such as in identifying antibiotic drug targets, but can also enhance our understanding of basic biology, such as functions needed to support cellular life. Enzymes are critical for almost all cellular activities. However, essential genes have not been systematically examined from the aspect of enzymes and the chemical reactions that they catalyze. Here, by comprehensively analyzing essential genes in 14 bacterial genomes in which large-scale gene essentiality screens have been performed, we found that enzymes are enriched in essential genes. Essential enzymes have overrepresented ligases (especially those forming carbon-oxygen bonds and carbon-nitrogen bonds), nucleotidyltransferases and phosphotransferases, while have underrepresented oxidoreductases. Furthermore, essential enzymes tend to associate with more gene ontology domains. These results, from the aspect of chemical reactions, provide further insights into the understanding of functions needed to support natural cellular life, as well as synthetic cells, and provide additional parameters that can be integrated into gene essentiality prediction algorithms
    corecore